Geographical analysis of media flows
A multidimensional approach
Introduction
1 DATA COLLECTION
1.1 Importation of RSS
1.1.1 The Mediacloud database
(tbd : presentation of the MediaCloud project)
Mediacloud can be freely used by researchers. All you have to do is to create an account at the following adress :
https://explorer.mediacloud.org
You have different ways to get title of news. We will focus here on a simple example of data obtained through the mediacloud interface. We suppose that you want to extract news from the Tunisian newspapers speaking from Europe.
1.1.2 Selection of media with source manager
We use the application called Source Manager and we introduce a research by collection which is the most convenient to explore what is available in a country. In our example, the target country is Tunisia and we have three collections that are propsed :
We have selected the collection named “Tunisia National” because we are interested in the most important newspapers of the country.
The buble graphic on the right indicates immediately the media that has produced the highest number of news, but it is wise to explore in more details the list on the left which indicates for each media the statting date of data collection.
When a media appears interesting, we click on its name to obtain a brief summary of the metadata. For example, in the case of L’économiste Maghrebin the metadata indicates :
The media looks promising, but before to go further, it can be better to have a look at the website of the media to have a more concrete idea of the content if we don’t know in advance what it is about in terms of content, what is the ideological orientation, etc.
Here we can see that this is an ecnomic journal, published in french, with news organized in concentric geographic circles (Nation > Maghreb > Africa > World) which is precisely what we are looking for in the IMAGEUN project. We will further complete the informations about this, but before to do that we have to check in more details if the production of the media is regular through time with another tool offered by mediacloud, the explorer.
1.1.3 Checking the stability through time
We have clicked on search in explorer on the metadata page of the Source Manager and obtain a news interfacce where we modify the date to cover the full period of collection of the media (or our period of interest). In the research field, we let the search term * which indicates a research on all news.
Below your request, you obtain a graphic entitled Attention Over Time with the distribution of the number of news published per day which help you to verify if the distribution of news is regular through time. You just have to modify the type of graphic in order to visualize Story Count and you can choose the time span you want (day, week or month) for the evaluation of the regularity of news flow. In our example, we notice that at daily level they are some brief period of break in 2019, but the flow is reasonnabely regular with approximatively 5 news per day at the beginning and 10 to 20 in the final period. We also notice a classical week cycle with a decrease of news published during the week-end.
Going down, you will find a news panel entitled Total Attention which gives you the total number of stories found. In our example, we have a total of 13626 stories produced by our media over the period.
1.1.5 Download and storage of news
According to your selection (all news or a specific topic) you will download more or less title. Here, me make the choice to get all news, which means that we have to repeat the original request with *.
Finally, by clicking on the button Download all story URLS, you can get a .csv file that you can easily load in your favorite programming language as we will see in the next section.
1.2 Corpus creation
knitr::opts_chunk$set(cache = TRUE,
echo = TRUE,
comment = "")In the previous section (ref…) whe have obtained a .csv file of news collected from MediaCloud. We will try now to put this data in a standard form and we have chosen the format of the quanteda package as reference for data organization and storage.
But of course the researchers involved in the project can prefer to use other R packages like tm or tidytext. And they can also prefer to use another programming language for Python. It is the reason why we explain how to transform and export the data that has been prepared and harmonized with quanteda in various format like .csv or JSON.
We detail here an example of importation with the example of the newspaper “L’économiste maghrebin”
1.2.1 Importation of text to R
This step is not always obvious because many problems of encoding can appear that are more or less easy to solve. In principle , the data from Media Cloud are exported in standard UTF-8 but as we will see it is not necessary the case.
We try firstly to use the standard R function read.csv():
store <- "data"
media <- "fr_TUN_ecomag"
type <-".csv"
fic <- paste(store,"/",media,type,sep="")
df<-read.csv(fic,
sep=",",
header=T,
encoding = "UTF-8",
stringsAsFactors = F)
kable(head(df))| stories_id | publish_date | title | url | language | ap_syndicated | themes | media_id | media_name | media_url |
|---|---|---|---|---|---|---|---|---|---|
| 1129295780 | 2019-01-02 03:42:46 | Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 | https://www.leconomistemaghrebin.com/2019/01/02/tarifs-adsl-reduits-1-janvier-2019/ | fr | False | 623820 | L’Economiste Maghrebin | http://www.leconomistemaghrebin.com/ | |
| 1129295771 | 2019-01-02 04:06:27 | 6ème Sfax Marathon International des Oliviers | https://www.leconomistemaghrebin.com/2019/01/02/sfax-marathon-international-oliviers/ | fr | False | 623820 | L’Economiste Maghrebin | http://www.leconomistemaghrebin.com/ | |
| 1129295760 | 2019-01-02 06:05:08 | Télécharger la version finale de la Loi de finances 2019 | https://www.leconomistemaghrebin.com/2019/01/02/telecharger-la-version-finale-de-la-loi-de-finances-2019/ | en | False | 623820 | L’Economiste Maghrebin | http://www.leconomistemaghrebin.com/ | |
| 1129578051 | 2019-01-02 10:05:06 | Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public | https://www.leconomistemaghrebin.com/2019/01/02/chawki-tabib-245-dossiers-transferes-au-ministere-public/ | fr | False | 623820 | L’Economiste Maghrebin | http://www.leconomistemaghrebin.com/ | |
| 1129461662 | 2019-01-02 07:52:36 | Panoro Energy finalise l’acquisition de OMV Tunisia | https://www.leconomistemaghrebin.com/2019/01/02/panoro-energy-finalise-lacquisition-de-omv-tunisia/ | fr | False | 623820 | L’Economiste Maghrebin | http://www.leconomistemaghrebin.com/ | |
| 1129461636 | 2019-01-02 08:57:54 | La partie syndicale maintient le boycott des examens du secondaire | https://www.leconomistemaghrebin.com/2019/01/02/partie-syndicale-boycott-examens-secondaire/ | fr | False | 623820 | L’Economiste Maghrebin | http://www.leconomistemaghrebin.com/ |
The importation was successfull for 12794 news but message of errors appeared for 3 news where R sent a message of error telling :
Error in gregexpr(calltext, singleline, fixed = TRUE) : regular expression is invalid UTF-8
Looking in more details, we discover also some problems of encoding in news like in the following example where the text of the news appears differently if we apply the standard functions paste() o0 the specialized function r knitr::kable for printing.
paste(df[9, 3])[1] "Néji Jalloul : “Nidaa Tounes peut revenir si…”"
kable((df[9,3]))| x |
|---|
| Néji Jalloul : “Nidaa Tounes peut revenir si…” |
1.2.2 Resolution of encoding problems
It is sometime possible to adapt manually the encoding problem whan they are not too much as in present example.
df$text<-df$title
# standardize apostrophe
df$text<-gsub("’","'",df$text)
# standardize punct
df$text<-gsub('…','.',df$text)
# standardize hyphens
df$text<-gsub('–','-',df$text)
# Remove quotation marks
df$text<-gsub('« ','',df$text)
df$text<-gsub(' »','',df$text)
df$text<-gsub('“','',df$text)
df$text<-gsub('”','',df$text)
df$text<-gsub('‘','',df$text)
df$text<-gsub('″','',df$text)We can introduce other cleaning procedures here or keep it for later analysis
1.2.3 Transformation in quanteda format
We propose a storage based on quanteda format by just transforming the data that has been produced by readtext. We keep only the name of the source and the date of publication.
# Create Quanteda corpus
qd<-corpus(df,docid_field = "stories_id")
# Select docvar fields and rename media
qd$date <-as.Date(qd$publish_date)
qd$source <-media
docvars(qd)<-docvars(qd)[,c("source","date")]
# Add global meta
meta(qd,"meta_source")<-"Media Cloud "
meta(qd,"meta_time")<-"Download the 2021-09-30"
meta(qd,"meta_author")<-"Elaborated by Claude Grasland"
meta(qd,"project")<-"ANR-DFG Project IMAGEUN"We have created a quanteda object with a lot of information stored in various fields. The structure of the object is the following one
str(qd) 'corpus' Named chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" ...
- attr(*, "names")= chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
- attr(*, "docvars")='data.frame': 12794 obs. of 5 variables:
..$ docname_: chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
..$ docid_ : Factor w/ 12794 levels "1129295780","1129295771",..: 1 2 3 4 5 6 7 8 9 10 ...
..$ segid_ : int [1:12794] 1 1 1 1 1 1 1 1 1 1 ...
..$ source : chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
..$ date : Date[1:12794], format: "2019-01-02" "2019-01-02" ...
- attr(*, "meta")=List of 3
..$ system:List of 6
.. ..$ package-version:Classes 'package_version', 'numeric_version' hidden list of 1
.. .. ..$ : int [1:3] 3 0 0
.. ..$ r-version :Classes 'R_system_version', 'package_version', 'numeric_version' hidden list of 1
.. .. ..$ : int [1:3] 4 1 0
.. ..$ system : Named chr [1:3] "Windows" "x86-64" "claude"
.. .. ..- attr(*, "names")= chr [1:3] "sysname" "machine" "user"
.. ..$ directory : chr "C:/git/geomedia"
.. ..$ created : Date[1:1], format: "2021-11-25"
.. ..$ source : chr "data.frame"
..$ object:List of 2
.. ..$ unit : chr "documents"
.. ..$ summary:List of 2
.. .. ..$ hash: chr(0)
.. .. ..$ data: NULL
..$ user :List of 4
.. ..$ meta_source: chr "Media Cloud "
.. ..$ meta_time : chr "Download the 2021-09-30"
.. ..$ meta_author: chr "Elaborated by Claude Grasland"
.. ..$ project : chr "ANR-DFG Project IMAGEUN"
We can look at the first titles with head()
kable(head(qd,3))| x | |
|---|---|
| 1129295780 | Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 |
| 1129295771 | 6ème Sfax Marathon International des Oliviers |
| 1129295760 | Télécharger la version finale de la Loi de finances 2019 |
We can get meta information on each stories with summary()
summary(head(qd,3))Corpus consisting of 3 documents, showing 3 documents:
Text Types Tokens Sentences source date
1129295780 11 11 1 fr_TUN_ecomag 2019-01-02
1129295771 6 6 1 fr_TUN_ecomag 2019-01-02
1129295760 8 10 1 fr_TUN_ecomag 2019-01-02
We can get meta information about the full document
meta(qd)$meta_source
[1] "Media Cloud "
$meta_time
[1] "Download the 2021-09-30"
$meta_author
[1] "Elaborated by Claude Grasland"
$project
[1] "ANR-DFG Project IMAGEUN"
1.2.4 Storage of the quanteda object
We can finally save the object in .RDS format in a directory dedicated to our quanteda files. It can be usefull to give some information in the name of the file
store <- "data"
type<- ".RDS"
myfile <- paste(store,"/",media,type,sep="")
myfile[1] "data/fr_TUN_ecomag.RDS"
saveRDS(qd,myfile)
qd[1:3]Corpus consisting of 3 documents and 2 docvars.
1129295780 :
"Les tarifs de l'ADSL réduits à partir du 1er janvier 2019"
1129295771 :
"6ème Sfax Marathon International des Oliviers"
1129295760 :
"Télécharger la version finale de la Loi de finances 2019"
summary(qd,3)Corpus consisting of 12794 documents, showing 3 documents:
Text Types Tokens Sentences source date
1129295780 11 11 1 fr_TUN_ecomag 2019-01-02
1129295771 6 6 1 fr_TUN_ecomag 2019-01-02
1129295760 8 10 1 fr_TUN_ecomag 2019-01-02
We have kept all the information present in the initial file, but also added specific metadata of interest for us. The size of the storage is now equal to 0.6 Mb which means a division by 6 as compared to the initial .csv file downloaded from Media Cloud where the size was 3.8 Mb.
1.2.5 Back transformation to tibble
In the following steps, we will make an intensive use of quanteda, but sometimes it can be useful to export the results in a more practical format or to use other packages. For this reasons, it is important to know that the tidytextpackage can easily transform quanteda object in tibbles which are more classical and easy to manage and to export in other formats like data.frame or data.table.
td <- tidy(qd)
kable(head(td))| text | source | date |
|---|---|---|
| Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 | fr_TUN_ecomag | 2019-01-02 |
| 6ème Sfax Marathon International des Oliviers | fr_TUN_ecomag | 2019-01-02 |
| Télécharger la version finale de la Loi de finances 2019 | fr_TUN_ecomag | 2019-01-02 |
| Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public | fr_TUN_ecomag | 2019-01-02 |
| Panoro Energy finalise l’acquisition de OMV Tunisia | fr_TUN_ecomag | 2019-01-02 |
| La partie syndicale maintient le boycott des examens du secondaire | fr_TUN_ecomag | 2019-01-02 |
str(td)tibble [12,794 x 3] (S3: tbl_df/tbl/data.frame)
$ text : chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" "6ème Sfax Marathon International des Oliviers" "Télécharger la version finale de la Loi de finances 2019" "Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public" ...
$ source: chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
$ date : Date[1:12794], format: "2019-01-02" "2019-01-02" ...
Bibliographie
Annexes
Infos session
| setting | value |
|---|---|
| version | R version 4.1.0 (2021-05-18) |
| os | Windows 10 x64 |
| system | x86_64, mingw32 |
| ui | RTerm |
| language | (EN) |
| collate | French_France.1252 |
| ctype | French_France.1252 |
| tz | Europe/Paris |
| date | 2021-11-25 |
| package | ondiskversion | source |
|---|---|---|
| dplyr | 1.0.6 | CRAN (R 4.1.0) |
| ggplot2 | 3.3.3 | CRAN (R 4.1.0) |
| knitr | 1.34 | CRAN (R 4.1.1) |
| quanteda | 3.0.0 | CRAN (R 4.1.0) |
| readtext | 0.80 | CRAN (R 4.1.0) |
| rmarkdown | 2.11 | CRAN (R 4.1.1) |
| rzine | 0.1.0 | gitlab (rzine/package@a94bf55) |
| tidytext | 0.3.1 | CRAN (R 4.1.1) |
| WikidataR | 2.3.1 | CRAN (R 4.1.1) |
Citation
@Manual{ficheRzine,
title = {Titre de la fiche},
author = {{Auteur.e.s}},
organization = {Rzine},
year = {202x},
url = {http://rzine.fr/},
}